Skip to content

Conversation

@m4l1c1ou5
Copy link
Contributor

@m4l1c1ou5 m4l1c1ou5 commented Sep 23, 2025

Introduce a new MADR that defines how Envoy’s zone-aware routing feature will be integrated into Kuma’s control plane.

What changed

  • Added MADR-083: “Enable Envoy Zone-Aware Routing in Kuma”
  • Defined decision context, drivers, and outcome for injecting zone_aware_lb_config via EDS/CDS
  • Specified new MeshLoadBalancingStrategy API with type: ZoneAware
  • Provided example YAML showing ZoneAware configuration using zoneIdentifier and minClusterSize

Why

  • Ensure per-host throughput balance and minimize cross-AZ data transfer costs
  • Preserve Kuma’s declarative abstraction and existing kuma.io/zone labels

Implements MADR for #9161

@m4l1c1ou5 m4l1c1ou5 requested a review from a team as a code owner September 23, 2025 04:48
Introduce a new MADR that defines how Envoy’s zone-aware routing feature will be integrated into Kuma’s control plane.

### What changed
- Added MADR-083: “Enable Envoy Zone-Aware Routing in Kuma”
- Defined decision context, drivers, and outcome for injecting zone_aware_lb_config via EDS/CDS
- Specified new MeshLoadBalancingStrategy API with type: ZoneAware
- Provided example YAML showing ZoneAware configuration using zoneIdentifier and minClusterSize

### Why
- Ensure per-host throughput balance and minimize cross-AZ data transfer costs
- Preserve Kuma’s declarative abstraction and existing kuma.io/zone labels

Implements MADR for kumahq#9161

Signed-off-by: Arpit Mishra <[email protected]>
@github-actions
Copy link
Contributor

Reviewer Checklist

🔍 Each of these sections need to be checked by the reviewer of the PR 🔍:
If something doesn't apply please check the box and add a justification if the reason is non obvious.

  • Is the PR title satisfactory? Is this part of a larger feature and should be grouped using > Changelog?
  • PR description is clear and complete. It Links to relevant issue as well as docs and UI issues
  • This will not break child repos: it doesn't hardcode values (.e.g "kumahq" as an image registry)
  • IPv6 is taken into account (.e.g: no string concatenation of host port)
  • Tests (Unit test, E2E tests, manual test on universal and k8s)
    • Don't forget ci/ labels to run additional/fewer tests
  • Does this contain a change that needs to be notified to users? In this case, UPGRADE.md should be updated.
  • Does it need to be backported according to the backporting policy? (this GH action will add "backport" label based on these file globs, if you want to prevent it from adding the "backport" label use no-backport-autolabel label)

@lobkovilya lobkovilya changed the title docs(madr): add MADR-083 for Envoy Zone-Aware Routing docs(MADR): add MADR-083 for Envoy Zone-Aware Routing Sep 23, 2025
@lobkovilya lobkovilya added the ci/skip-test PR: Don't run unit and e2e tests (maybe this is just a doc change) label Sep 23, 2025
Copy link
Contributor

@slonka slonka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense, would love some input from @justindavies as well

Comment on lines +38 to +41
- Extend `MeshLoadBalancingStrategy.spec.default.localityAwareness.localZone` to accept:
- `type: ZoneAware`
- `minClusterSize`: mapped to Envoy’s `zone_aware_lb_config.min_cluster_size`
- `zoneIdentifier`: tag key for zone (e.g., `topology/availability-zone`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about

upstream.zone_routing.force_local_zone.min_size

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also do you know what @lukidzi wanted to achieve with this subZoneIdentifier mentioned in the issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slonka 'upstream.zone_routing.force_local_zone' is a runtime config and as far as i understand, runtime config does not support cluster level configurations.
Is it correct or I'm missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure about why @lukidzi suggested subzoneid

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukidzi - can you weight in here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lukidzi any update here?

Copy link
Contributor

@lobkovilya lobkovilya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the MADR, I think I understand the motivation, I left a comment related to the API

name: backend
sectionName: http
default:
loadBalancer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC zone_aware_lb_config and locality_weighted_lb_config are 2 alternative approaches and they're mutually exclusive. At this moment, we are already using locality_weighted_lb_config when configuring policy like

type: MeshLoadBalancingStrategy
name: local-zone-affinity-backend
mesh: default
spec:
  to:
  - targetRef:
      kind: MeshService
      name: backend
    default:
      localityAwareness:
        localZone:
          affinityTags:
            - key: kubernetes.io/hostname
              weight: 9000
            - key: topology.kubernetes.io/zone
              weight: 9

Your suggested API change doesn't reflect the fact 2 approaches are mutually exclusive, i.e. we'd have to add extra validation to reject policies like this (because it's not possible to use zone_aware_lb and locality_weighted_lb at the same time):

type: MeshLoadBalancingStrategy
name: local-zone-affinity-backend
mesh: default
spec:
  to:
  - targetRef:
      kind: MeshService
      name: backend
    default:
      loadBalancer:
        type: ZoneAware
        zoneAware:
          # Minimum total endpoints before zone-aware logic activates
          minClusterSize: 1
          # Tag used on dataplanes to identify their zone
          zoneIdentifier: topology/availability-zone
      localityAwareness:
        localZone:
          affinityTags:
            - key: kubernetes.io/hostname
              weight: 9000
            - key: topology.kubernetes.io/zone
              weight: 9

I think MeshLoadBalancingStrategy policy API should make it clear to users that zoneAware and localZone are 2 mutually exclusive approaches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lobkovilya i was also concerned about this issue, but since the existing policy structure does not have generic keys like 'type' that takes kind of enums, I'm not able to solve this issue
All i can think of is adding validation layer during policy apply
If you have any better solution, please help

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lobkovilya any suggestions here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lobkovilya any update here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/skip-test PR: Don't run unit and e2e tests (maybe this is just a doc change)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants